Convolution

As the name implies, convolution operations are an important component of convolutional neural networks. The ability for a CNN to accurately match diverse patterns can be attributed to using convolution operations. These operations require complex input which was shown in the previous section. In this section we'll experiment with convolution operations and the parameters which are available to tune them.

Convolution operation convolving two input tensors (input and kernel) into a single output tensor which represents information from each input.


Input and Kernel

Convolution operations in TensorFlow are done using tf.nn.conv2d in a typical situation. There are other convolution operations available using TensorFlow designed with special use cases. tf.nn.conv2d is the preferred convolution operation to begin experimenting with. For example, we can experiment with convolving two tensors together and inspect the result.


In [1]:
# setup-only-ignore
import tensorflow as tf
import numpy as np

In [2]:
# setup-only-ignore
sess = tf.InteractiveSession()

In [3]:
input_batch = tf.constant([
        [  # First Input
            [[0.0], [1.0]],
            [[2.0], [3.0]]
        ],
        [  # Second Input
            [[2.0], [4.0]],
            [[6.0], [8.0]]
        ]
    ])

kernel = tf.constant([
        [
            [[1.0, 2.0]]
        ]
    ])

The example code creates two tensors. The input_batch tensor has a similar shape to the image_batch tensor seen in the previous section. This will be the first tensor being convolved and the second tensor will be kernel. Kernel is an important term that is interchangeable with weights, filter, convolution matrix or mask. Since this task is computer vision related, it's useful to use the term kernel because it is being treated as an image kernel. There is no practical difference in the term when used to describe this functionality in TensorFlow. The parameter in TensorFlow is named filter and it expects a set of weights which will be learned from training. The amount of different weights included in the kernel (filter parameter) will configure the amount of kernels which will be learned.

In the example code, there is a single kernel which is the first dimension of the kernel variable. The kernel is built to return a tensor which will include one channel with the original input and a second channel with the original input doubled. In this case, channel is used to describe the elements in a rank 1 tensor (vector). Channel is a term from computer vision which describes the output vector, for example an RGB image has three channels represented as a rank 1 tensor [red, green, blue]. At this time, ignore the strides and padding parameter which will be covered later and focus on the convolution (tf.nn.conv2d) output.


In [4]:
conv2d = tf.nn.conv2d(input_batch, kernel, strides=[1, 1, 1, 1], padding='SAME')

sess.run(conv2d)


Out[4]:
array([[[[  0.,   0.],
         [  1.,   2.]],

        [[  2.,   4.],
         [  3.,   6.]]],


       [[[  2.,   4.],
         [  4.,   8.]],

        [[  6.,  12.],
         [  8.,  16.]]]], dtype=float32)

The output is another tensor which is the same rank as the input_batch but includes the number of dimensions found in the kernel. Consider if input_batch represented an image, the image would have a single channel, in this case it could be considered a grayscale image (see Working with Colors). Each element in the tensor would represent one pixel of the image. The pixel in the bottom right corner of the image would have the value of 3.0.

Consider the tf.nn.conv2d convolution operation as a combination of the image (represented as input_batch) and the kernel tenser. The convolution of these two tensors create a feature map. Feature map is a broad term except in computer vision where it relates to the output of operations which work with an image kernel. The feature map now represents the convolution of these tensors by adding new layers to the output.

The relationship between the input images and the output feature map can be explored with code. Accessing elements from the input batch and the feature map are done using the same index. By accessing the same pixel in both the input and the feature map shows how the input was changed when it convolved with the kernel. In the following case, the lower right pixel in the image was changed to output the value found by multiplying \\(3.0 \* 1.0\\) and \\(3.0 \* 2.0\\). The values correspond to the pixel value and the corresponding value found in the kernel.


In [5]:
lower_right_image_pixel = sess.run(input_batch)[0][1][1]
lower_right_kernel_pixel = sess.run(conv2d)[0][1][1]

lower_right_image_pixel, lower_right_kernel_pixel


Out[5]:
(array([ 3.], dtype=float32), array([ 3.,  6.], dtype=float32))

In this simplified example, each pixel of every image is multiplied by the corresponding value found in the kernel and then added to a corresponding layer in the feature map. Layer, in this context, is referencing a new dimension in the output. With this example, it's hard to see a value in convolution operations.

Strides

The value of convolutions in computer vision is their ability to reduce the dimensionality of the input, which is an image in this case. An image's dimensionality (2D image) is its width, height and number of channels. A large image dimensionality requires an exponentially larger amount of time for a neural network to scan over every pixel and judge which ones are important. Reducing dimensionality of an image with convolutions is done by altering the strides of the kernel.

The parameter strides, causes a kernel to skip over pixels of an image and not include them in the output. It's not fair to say the pixels are skipped because they still may affect the output. The strides parameter highlights how a convolution operation is working with a kernel when a larger image and more complex kernel are used. As a convolution is sliding the kernel over the input, it's using the strides parameter to change how it walks over the input. Instead of going over every element of an input the strides parameter could configure the convolution to skip certain elements.

For example, take the convolution of a larger image and a larger kernel. In this case, it's a convolution between a 6 pixel tall, 6 pixel wide and 1 channel deep image (6x6x1) and a (3x3x1) kernel.


In [6]:
input_batch = tf.constant([
        [  # First Input (6x6x1)
            [[0.0], [1.0], [2.0], [3.0], [4.0], [5.0]],
            [[0.1], [1.1], [2.1], [3.1], [4.1], [5.1]],
            [[0.2], [1.2], [2.2], [3.2], [4.2], [5.2]],
            [[0.3], [1.3], [2.3], [3.3], [4.3], [5.3]],
            [[0.4], [1.4], [2.4], [3.4], [4.4], [5.4]],
            [[0.5], [1.5], [2.5], [3.5], [4.5], [5.5]],
        ],
    ])

kernel = tf.constant([  # Kernel (3x3x1)
        [[[0.0]], [[0.5]], [[0.0]]],
        [[[0.0]], [[1.0]], [[0.0]]],
        [[[0.0]], [[0.5]], [[0.0]]]
    ])

# NOTE: the change in the size of the strides parameter.
conv2d = tf.nn.conv2d(input_batch, kernel, strides=[1, 3, 3, 1], padding='SAME')
sess.run(conv2d)


Out[6]:
array([[[[ 2.20000005],
         [ 8.19999981]],

        [[ 2.79999995],
         [ 8.80000019]]]], dtype=float32)

The input_batch was combined with the kernel by moving the kernel over the input_batch striding (or skipping) over certain elements. Each time the kernel was moved, it get centered over an element of input_batch. Then the overlapping values are multiplied together and the result is added together. This is how a convolution combines two inputs using what's referred to as pointwise multiplication. It may be easier to visualize using the following figure.

In this figure, the same logic is done as what is found in the code. Two tensors convolved together while striding over the input. The strides reduced the dimensionality of the output a large amount while the kernel size allowed the convolution to use all the input values. None of the input data was completely removed from striding but now the input is a smaller tensor.

Strides are a way to adjust the dimensionality of input tensors. Reducing dimensionality requires less processing power, and will keep from creating receptive fields which completely overlap. The strides parameter follows the same format as the input tensor [image_batch_size_stride, image_height_stride, image_width_stride, image_channels_stride]. Changing the first or last element of the stride parameter are rare, they'd skip data in a tf.nn.conv2d operation and not take the input into account. The image_height_stride and image_width_stride are useful to alter in reducing input dimensionality.

A challenge which comes up often with striding over the input is how to deal with a stride which doesn't evenly end at the edge of the input. The uneven striding will come up often due to image size and kernel size not matching the striding. If the image size, kernel size and strides can't be changed then padding can be added to the image to deal with the uneven area.

Padding

When a kernel is overlapped on an image it should be set to fit within the bounds of the image. At times, the sizing may not fit and a good alternative is to fill the missing area in the image. Filling the missing area of the image is known as padding the image. TensorFlow will pad the image with zeros or raise an error when the sizes don't allow a kernel to stride over an image without going past its bounds. The amount of zeros or the error state of tf.nn.conv2d is controlled by the parameter padding which has two possible values ('VALID', 'SAME').

SAME: The convolution output is the SAME size as the input. This doesn't take the filter's size into account when calculating how to stride over the image. This may stride over more of the image than what exists in the bounds while padding all the missing values with zero.

VALID: Take the filter's size into account when calculating how to stride over the image. This will try to keep as much of the kernel inside the image's bounds as possible. There may be padding in some cases but will avoid.

It's best to consider the size of the input but if padding is necessary then TensorFlow has the option built in. In most simple scenarios, SAME is a good choice to begin with. VALID is preferential when the input and kernel work well with the strides. For further information, TensorFlow covers this subject well in the convolution documentation.

Data Format

There's another parameter to tf.nn.conv2d which isn't shown from these examples named data_format. The tf.nn.conv2d docs explain how to change the data format so the input, kernel and strides follow a format other than the format being used thus far. Changing this format is useful if there is an input tensor which doesn't follow the [batch_size, height, width, channel] standard. Instead of changing the input to match, it's possible to change the data_format parameter to use a different layout.

data_format: An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the input and output data. With the default format "NHWC", the data is stored in the order of: [batch, in_height, in_width, in_channels]. Alternatively, the format could be "NCHW", the data storage order of: [batch, in_channels, in_height, in_width].

Data Format Definition
N Number of tensors in a batch, the batch_size.
H Height of the tensors in each batch.
W Width of the tensors in each batch.
C Channels of the tensors in each batch.

Kernels in Depth

In TensorFlow the filter parameter is used to specify the kernel convolved with the input. Filters are commonly used in photography to adjust attributes of a picture, such as the amount of sunlight allowed to reach a camera's lens. In photography, filters allow a photographer to drastically alter the picture they're taking. The reason the photographer is able to alter their picture using a filter is because the filter can recognize certain attributes of the light coming in to the lens. For example, a red lens filter will absorb (block) every frequency of light which isn't red allowing only red to pass through the filter.

In computer vision, kernels (filters) are used to recognize important attributes of a digital image. They do this by using certain patterns to highlight when features exist in an image. A kernel which will replicate the red filter example image is implemented by using a reduced value for all colors except red. In this case, the reds will stay the same but all other colors matched are reduced.

The example seen at the start of this chapter uses a kernel designed to do edge detection. Edge detection kernels are common in computer vision applications and could be implemented using basic TensorFlow operations and a single tf.nn.conv2d operation.


In [7]:
# setup-only-ignore
import matplotlib as mil
#mil.use('svg')
mil.use("nbagg")
from matplotlib import pyplot
fig = pyplot.gcf()
fig.set_size_inches(4, 4)

image_filename = "./images/chapter-05-object-recognition-and-classification/convolution/n02113023_219.jpg"
image_filename = "/Users/erikerwitt/Downloads/images/n02085936-Maltese_dog/n02085936_804.jpg"
filename_queue = tf.train.string_input_producer(
    tf.train.match_filenames_once(image_filename))

image_reader = tf.WholeFileReader()
_, image_file = image_reader.read(filename_queue)
image = tf.image.decode_jpeg(image_file)

sess.run(tf.initialize_all_variables())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)

image_batch = tf.image.convert_image_dtype(tf.expand_dims(image, 0), tf.float32, saturate=False)

In [8]:
kernel = tf.constant([
        [
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]
        ],
        [
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]
        ],
        [
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]
        ]
    ])


conv2d = tf.nn.conv2d(image_batch, kernel, [1, 1, 1, 1], padding="SAME")
activation_map = sess.run(tf.minimum(tf.nn.relu(conv2d), 255))

In [11]:
# setup-only-ignore
fig = pyplot.gcf()
pyplot.imshow(activation_map[0], interpolation='nearest')
#pyplot.show()
fig.set_size_inches(4, 4)
fig.savefig("./images/chapter-05-object-recognition-and-classification/convolution/example-edge-detection.png")

The output created from convolving an image with an edge detection kernel are all the areas where and edge was detected. The code assumes a batch of images is already available (image_batch) with a real image loaded from disk. In this case, the image is an example image found in the Stanford Dogs Dataset. The kernel has three input and three output channels. The channels sync up to RGB values between \\([0, 255]\\) with 255 being the maximum intensity. The tf.minimum and tf.nn.relu calls are there to keep the convolution values within the range of valid RGB colors of \\([0, 255]\\).

There are many other) common kernels which can be used in this simplified example. Each will highlight different patterns in an image with different results. The following kernel will sharpen an image by increasing the intensity of color changes.


In [12]:
kernel = tf.constant([
        [
            [[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]]
        ],
        [
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ 5., 0., 0.], [ 0., 5., 0.], [ 0., 0., 5.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]
        ],
        [
            [[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ 0, 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]]
        ]
    ])


conv2d = tf.nn.conv2d(image_batch, kernel, [1, 1, 1, 1], padding="SAME")
activation_map = sess.run(tf.minimum(tf.nn.relu(conv2d), 255))

In [15]:
# setup-only-ignore
fig = pyplot.gcf()
pyplot.imshow(activation_map[0], interpolation='nearest')
#pyplot.show()
fig.set_size_inches(4, 4)
fig.savefig("./images/chapter-05-object-recognition-and-classification/convolution/example-sharpen.png")

The values in the kernel were adjusted with the center of the kernel increased in intensity and the areas around the kernel reduced in intensity. The change, matches patterns with intense pixels and increases their intensity outputting an image which is visually sharpened. Note that the corners of the kernel are all 0 and don't affect the output which operates in a plus shaped pattern.

These kernels match patterns in images at a rudimentary level. A convolutional neural network matches edges and more by using a complex kernel it learned during training. The starting values for the kernel are usually random and over time they're trained by the CNN's learning layer. When a CNN is complete, it starts running and each image sent in is convolved with a kernel which is then changed based on if the predicted value matches the labeled value of the image. For example, if a Sheepdog picture is considered a Pit Bull by the CNN being trained it will then change the filters a small amount to try and match Sheepdog pictures better.

Learning complex patterns with a CNN involves more than a single layer of convolution. Even the example code included a tf.nn.relu layer used to prepare the output for visualization. Convolution layers may occur more than once in a CNN but they'll likely include other layer types as well. These layers combined form the support network required for a successful CNN architecture.


In [14]:
# setup-only-ignore
filename_queue.close(cancel_pending_enqueues=True)
coord.request_stop()
coord.join(threads)